Aiming at the shortcomings of single network classification model, this paper applies CNN-LSTM (convolutional neural\nnetworks-long short-term memory) combined network in the field of music emotion classification and proposes a multifeature\ncombined network classifier based on CNN-LSTM which combines 2D (two-dimensional) feature input through CNN-LSTM and\n1D (single-dimensional) feature input through DNN (deep neural networks) to make up for the deficiencies of original single\nfeature models. The model uses multiple convolution kernels in CNN for 2D feature extraction, BiLSTM (bidirectional LSTM) for\nserialization processing and is used, respectively, for audio and lyrics single-modal emotion classification output. In the audio\nfeature extraction, music audio is finely divided and the human voice is separated to obtain pure background sound clips; the\nspectrogram and LLDs (Low Level Descriptors) are extracted therefrom. In the lyrics feature extraction, the chi-squared test vector\nand word embedding extracted by Word2vec are, respectively, used as the feature representation of the lyrics. Combining the two\ntypes of heterogeneous features selected by audio and lyrics through the classification model can improve the classification\nperformance. In order to fuse the emotional information of the two modals of music audio and lyrics, this paper proposes a\nmultimodal ensemble learning method based on stacking, which is different from existing feature-level and decision-level fusion\nmethods, the method avoids information loss caused by direct dimensionality reduction, and the original features are converted\ninto label results for fusion, effectively solving the problem of feature heterogeneity. Experiments on million song dataset show\nthat the audio classification accuracy of the multifeature combined network classifier in this paper reaches 68%, and the lyrics\nclassification accuracy reaches 74%. The average classification accuracy of the multimodal reaches 78%, which is significantly\nimproved compared with the single-modal.
Loading....